A short PechaKucha style presentation of the application can be viewed here: CLICK HERE

Introduction

On July 1st, 2016, Chicago Department of Public Health (CDPH) identified an E. coli outbreak linked to a restaurant in Bridgeport neighborhood of Chicago. Within days, the number of cases swelled from initially report 25 cases to 65. The outbreak resulted in stunning 20 hospitalizations. By the time, the dust settled, over 100 people, including 40 food handlers from the restaurant, suffered illness. The outbreak, one of the worst in the cities recent history, exposed the critical work conducted by the city’s restaurant safety inspectors. Their report of the July 1st inspection identified critical failure in revealing “improper temperatures for several food items”. The actions mitigated the effect of an outbreak which could have been much worse, had it not been identified promptly.

This application presents a helpful tool to spatially predict inspections needed, created by forecasting the number of restaurant inspection failures (and thus, the number of additional inspections required) in Chicago for a given year.

The city of Chicago employs approximately 3 dozen restaurant food inspectors. With over 15,000 inspections taking place in Chicago in 2019 alone, the sheer volume of this task necessitates prediction-aided planning. Our application, presented here, allows for greater precision in anticipating possible inspection failures using machine-learning algorithms. In adopting our application, the city of Chicago shortens the feedback loop in identification of critical failures, thus preventing future outbreak. In addition to improving public health outcomes, our application improves the allocation of the limited number of inspectors yielding greater efficiency. The built-in scheduling algorithm prioritizes inspections in the high risk areas with mechanisms for revising inspector schedules based on latest available data. We know that certain risk-factors can significantly increase chance of a failure–we incorporate these risks real-time.

# Create a map containing boundary of Chicago.
chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

# Create a fishnet containing 500 by 500 foot squares over the boundary of Chicago.
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # <- MDH Added
  st_sf() %>%
  mutate(uniqueID = rownames(.))

## Neighborhoods to use 
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

Food <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") 

FoodInspectAll <- Food %>%
    mutate(year = substr(inspection_date,1,4)) %>% 
    filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect_All")

Exploratory Analysis

For anyone wishing to replicate this analysis, we emphasize the importance of visually understanding the spatial distribution of general restaurant inspection failure counts. It is also important to know the spatial layout of independent variables—the factors we use to predict Chicago’s inspection failures. This section provides an overview of our exploratory analysis.

First, we take a look at Chicago’s 2018 inspection failures. There are tens of thousands of inspections for 2018, with just under 20% of them being failed inspections. In our case example, we predict restaurant failures for Chicago in 2019 using 2018 inspection data, so in our exploratory analysis, we consider data from 2018. A simple plot of restaurant inspection failures by location in Chicago is displayed below.

FoodInspect <- Food %>%
    mutate(year = substr(inspection_date,1,4)) %>% 
    filter(year == "2018") %>%
    filter(results=="Fail") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")

ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = FoodInspect, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Failed Food Inspections, Chicago - 2018") +
  mapTheme(title_size = 14)

Several clusters immediately stand out. The darkest cluster, to the east, is the Loop. There are also higher numbers of failed inspections along the North Shore, and to the west in Lawndale. The spatial clustering of failed inspections is a good sign that spatial analysis is appropriate in predicting failed inspections.

One of the most important spatial factors in predicting inspection failures is restaurant locations. There cannot be a failure without a restaurant to be inspected in the first place, after all. When examining all restaurant inspection locations, this time, we will map density as a layer rather than plotting each establishment individually. This avoids situations where data for restaurants in close proximity could be obscured due to overlapping, which would be more of a problem given the size of the all-restaurants dataset. Displayed below are side-by-side density maps of all restaurants logged for inspection in 2018 and restaurant inspection failures for that year.

grid.arrange(ncol=2,

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(FoodInspectAll)),
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Food Inspections Density") +
  mapTheme(title_size = 14) + theme(legend.position = "none"),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey80") +
  stat_density2d(data = data.frame(st_coordinates(FoodInspect)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Failed Food Inspection Density") +
  mapTheme(title_size = 14) + theme(legend.position = "none")
)

Brighter green represents higher density on the maps above. The maps are very similar, which is a testament to the importance of restaurant locations as a predicting factor. Just by the visual comparison between the graphs, however, there seem to be some areas with higher concentrations of failures in comparison to total inspection numbers (specifically along the North Shore, and in in Lawndale). These patterns support the use of this application, because the differences in the maps show that assigning inspectors purely based on restaurant density would not be the optimal way for the Chicago Health Department to assign its resources. Some areas are more likely have inspection-failing restaurants than others despite similar counts of restaurants. Because failures add an additional inspection, this report’s predictive tool looks to determine where failures are likely to occur.

This tool aims to take data from one year and predict failed food inspections for the following year, so it is useful to compare the locations of two annual sets of food inspections side-by-side. The maps below illustrate locations of failed inspections in 2017 and 2018.

## Get 2017 data
FoodInspect17 <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
    mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2017") %>%
    filter(results=="Fail") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect")

grid.arrange(ncol=2,
             ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = FoodInspect17, colour="red", size=0.1, show.legend = "point") +
  labs(title= "2017 Failed Food Inspections") +
  mapTheme(title_size = 14),
  
  ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = FoodInspect, colour="red", size=0.1, show.legend = "point") +
  labs(title= "2018 Failed Food Inspections") +
  mapTheme(title_size = 14)
)

Failed inspections for 2017 and 2018 look very similar. This supports the notion that one year’s failed inspections can be used to predict those of the next year. Some of the differences can be attributed to the fact that a slightly different set of restaurants are inspected each year. For example, food inspections in 2019 covered 387 additional restaurants compared to 2018.

Our Spatial Process: the Fishnet

An ideal spatial process for our model must take into account factors that are relevant at a local level, and that factors affecting one restaurant are very likely to affect those nearby. Additionally, rather than predicting outcomes of individual scheduled inspections on a restaurant by restaurant basis, a spatial process should be used that is compatible with the fact that new establishments are added to the database on a yearly basis.

Predicting on a fishnet grid such that Chicago is divided into equally sized square cells accomplishes this. This model uses the previous year’s inspections and other risk factors to generate a prediction of the number of failed inspections in each cell. The map below displays the actual count of 2018 restaurant failures per cell on the grid. For the purposes of this model, each cell will be 500 square meters, area equivalent to about 2.5 Chicago city blocks.

## Add a value of 1 for each failure, sum them with aggregate
inspect_net <- 
  dplyr::select(FoodInspect) %>% 
  mutate(countInspect = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countInspect = replace_na(countInspect, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# Plot the fishnet displaying the count of inspection failures
ggplot() +
  geom_sf(data = inspect_net, aes(fill = countInspect), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Inspection Failures for the Fishnet") +
  mapTheme()

# Create a histogram for exploration of expected prediction distribution
HistogramInspect <-
  inspect_net %>%
    group_by(uniqueID) %>%
    summarize(countInspect_sum = sum(countInspect, na.rm = T),
              countInspect_mean = mean(countInspect, na.rm = T)) %>%
  ungroup()

# plot histogram of failures
HistogramInspect %>%
  ggplot(aes(countInspect_sum)) +
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    scale_x_continuous(breaks = seq(0, 100, by = 5)) +
    labs(title="Distribution of Failed Inspection Counts", subtitle = "Chicago, IL",
         x="Number of Failed Inspection Counts", y="Number of Cells")

The histogram above illustrates the distribution of cells according to their actual number of failed restaurants for 2018. Of the 2500 cells that comprise Chicago’s area, just over half recorded no failed inspections. The vast majority of cells that did have failed inspections had less than 10, though a few individual cells had 30 or more. When our 2018 model is used to predict failures by fishnet cell for 2019, predictions will have a similar distribution. Our spatial process should produce a similar result, as its estimates will reflect the clustered nature of restaurant failures in Chicago when it makes its predictions.

Modeling Approach and Risk Factors

As stated in the introduction, the goal of the model itself is to spatially predict failed inspections such that an estimate of the total number of inspections can be used to more efficiently allocate inspectors throughout Chicago. To predict inspection failures for a target year, our model uses a generalized linear model. Specifically, our model creates a function which, for each cell in the fishnet, takes a set of dependent variables and returns a predicted number of restaurant inspection failures. The actual predictions for each cell that are used as forecasts of failed inspections for each cell are outputs of k-fold validation. This will be explored in more detail in the next section, but the basic idea is that the predictions created by k-fold cross validation of one year’s data are used as predictions for the ensuing year.

This process, however, is only possible with a set of independent variables whose spatial variation collectively do well to predict restaurant inspection failures. Each independent variable or “risk factor” is explored in more detail in the subsections below.

Inspection Locations and Three Nearest Neighbors of Inspection Failure Locations

Inspection Locations - As described in the previous section, inspection locations are vital predictors of the next year’s failure locations. In order for a restaurant to fail an inspection, the inspection must take place at a location, and chances are that location is contained within the set of the previous year’s inspection locations. Like the rest of the risk factors, inspection locations are used in their fishnet form.

# Food establishment locations, 2018
foodInspectLocations <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
    mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    distinct() %>%
    mutate(Legend = "Food_Inspect_Locations") 

Inspection Failure Locations (3NN) - Inspection failure locations cannot be added as a risk factor outright, as k-folds validation in the next section seeks to optimize this exact variable. If inspection failure locations were included as is, k-folds validation would heavily overfit its predictions based on this variable, giving minimal consideration to other variables. The model would be more “accurate” in replicating the locations of inspection failures, but less generalizeable to the following year and less useful as a predictive tool. Instead, an independent variable is introduced reflecting the mean distance to the three nearest inspection failure points for any point in Chicago. Averaged by fishnet cell, this results in a factor that can help predict inspection failure locations without causing the cross validation to overfit.

# Food inspection failures, 2018
# 3NN to be performed on this risk factor later
foodInspectFail <-
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
    mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2018") %>%
    filter(results=="Fail") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Food_Inspect_Fail")

Six Risk Factors from the Chicago Open Data Portal

The following risk factors are considered to be potential predictors of Failed Food Inspections, and are accessed from the Chicago Open Data Portal:

Rodent Complaints (3NN) - This variable records location-based 311 service requests for rodent baiting over a year. There were 39,023 instances in 2018.

rodentBait <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historic   al/97t6-zrhs") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Rodent_Bait")

Sanitation (3NN) - This variable records location-based 311 service associated with sanitation code complaints. There were 3,316 instances in 2018.

sanitation311 <-
  read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac")) %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  filter(what_is_the_nature_of_this_code_violation_ %in%
  c("Garbage in Yard","Garbage in Alley","Dumpster not being emptied", "Overflowing carts", "Construction Site Cleanliness/Fence", "Standing water")) %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Sanitation_311")

Ordinance Violation (3NN) - This variable refers to reported Department of Buildings ordinance violations in Chicago. In 2018, there were reports of 43,243 of these.

ordinanceViolation <- 
  read.socrata("https://data.cityofchicago.org/Administration-Finance/Ordinance-Violations-Buildings-/awqx-tuwv") %>%
    mutate(year = substr(VIOLATION.DATE,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Ordinance_Violations")

Graffiti (3NN) - This records all 311 service requests for the removal of graffiti in Chicago. There were 55,873 of these requests in Chicago in 2018.

graffiti <-
  read.socrata(paste0("https://data.cityofchicago.org/Service",
  "-Requests/311-Service-Requests-Graffiti-Removal-Historical/",
  "hec5-y4x5")) %>%
  mutate(year = substr(creation_date,1,4)) %>%
  filter(year == "2018") %>%
  filter(where_is_the_graffiti_located_ %in%
  c("Front","Rear","Side")) %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Graffiti")

Liquor Retail (3NN) - This variable derives the locations of liquor stores from the Chicago Open Data Portal Database business license data.

liquorRetail <-
  read.socrata(paste0("https://data.cityofchicago.org/resource/",
  "nrmj-3kcf.json")) %>%
  filter(business_activity ==
  "Retail Sales of Packaged Liquor") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Liquor_Retail")

All five of these risk factors use Three Nearest Neighbor analysis. While it would not break the model’s generalizability to use these data in their raw form, Three Nearest Neighbor versions of these risk factors tended to perform slightly better in the model. The plot below combines the five new risk factors into a single map below. Note that the lighter colors correspond to cells farther from these factors, meaning that in this case, cells with darker colors have a higher likelihood of failed inspections being predicted in them.

# Create a binded set of nets with the variables of interest collected thus far
vars_net <- rbind(rodentBait, sanitation311, ordinanceViolation, graffiti, liquorRetail, foodInspectFail, foodInspectLocations)   %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet, by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup()

# Create long bersion
vars_net.long <-
  gather(vars_net, Variable, value, -geometry, -uniqueID)

# Create list
vars <- unique(vars_net.long$Variable)
  mapList1 <- list()

# Convenience to reduce length of function names.
st_c    <- st_coordinates
st_coid <- st_centroid

## create NN from rodent bait
vars_net <- vars_net %>%
    mutate(rodent_Bait.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(rodentBait),
                                           k = 3))

## create NN from sanitation 311
vars_net <- vars_net %>%
    mutate(sanitation_311.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(sanitation311),
                                           k = 3))

## create NN from ordinance violations
vars_net <- vars_net %>%
    mutate(ordinance_Violation.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(ordinanceViolation),
                                           k = 3))

## create NN from graffiti
vars_net <- vars_net %>%
    mutate(graffiti.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(graffiti),
                                           k = 3))

## create NN from liquorRetail
vars_net <- vars_net %>%
    mutate(liquor_Retail.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(liquorRetail),
                                           k = 3))

vars_net.long.nn <-
  dplyr::select(vars_net, ends_with(".nn")) %>%
  gather(Variable, value, -geometry)
  vars <- unique(vars_net.long.nn$Variable)

mapList2 <- list()

for(i in vars){
  mapList2[[i]] <-
  ggplot() +
  geom_sf(data = filter(vars_net.long.nn, Variable == i),
  aes(fill=value), colour=NA) +
  scale_fill_viridis(name="") +
  labs(title=i) +
mapTheme()}

do.call(grid.arrange,c(mapList2, ncol = 3,
top = "Nearest Neighbor risk Factors by Fishnet"))

# create NN from food_Inspect_Fail
# This comes after the others as to not appear in the group of plots
vars_net <- vars_net %>%
    mutate(food_Inspect_Fail.nn = nn_function(st_c(st_coid(vars_net)),
                                           st_c(foodInspectFail),
                                           k = 3))



# Gather all .nn vars as vars_net.long.nn
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

Additional Risk Factors

Distance to The Loop - Another factor considered by the model is distance from Chicago’s downtown area called “The Loop.” In the Exploratory Analysis section, it was noted that The Loop has a high number of failed inspections. This factor looks to operationalize that observation by creating a risk factor. Distance to The Loop is plotted below to the left.

Distance to Hot Spot - The final factor considered is the distance to a “hot spot” location, defined as major clusters of failed inspections in Chicago at a 99% or greater significance level. This time, the nearest neighbor function will be used to find the distance to a hot spot location.

Distance to Hot spot

#Distance to Loop to be used later
loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()
  vars_net$loopDistance =
  st_distance(st_centroid(vars_net), loopPoint) %>%
  as.numeric()
  
ggplot() +
    geom_sf(data = vars_net, aes(fill=loopDistance), colour=NA) +
    scale_fill_viridis(name="loopDistance") +
    labs(title="Distance to The Loop") +
    mapTheme()

## important to drop the geometry from joining features
final_net <-
  left_join(inspect_net, st_drop_geometry(vars_net), by="uniqueID") 

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
# print(final_net.weights, zero.policy=TRUE)

## see ?localmoran
local_morans <- localmoran(final_net$countInspect, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(countInspect = countInspect, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)


# generates warning from NN
final_net <- final_net %>% 
  mutate(Inspect.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(Inspect.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           Inspect.isSig == 1))), 
                       k = 1))

ggplot() +
      geom_sf(data = final_net, aes(fill=Inspect.isSig.dist), colour=NA) +
      scale_fill_viridis(name="Inspect.isSig.dist") +
      labs(title="NN Distance to Highly Significant \nFailed Inspect Clusters") +
      mapTheme()

Correlations

The plots below show linear correlations between the previously explored risk factors and failed inspections in Chicago.

correlation.long <-
  st_drop_geometry(final_net) %>%
  dplyr::select(-uniqueID, -cvID, -Sanitation_311, -Ordinance_Violations, -Graffiti, -Liquor_Retail, -Rodent_Bait, -Food_Inspect_Fail,  -Inspect.isSig.dist ) %>%
  gather(Variable, Value, -countInspect)

# colnames(final_net)

# , foodInspectLocations, -Inspect.isSig,-name,

correlation.cor <-
  correlation.long %>%
  group_by(Variable) %>%
  summarize(correlation = cor(Value, countInspect, use = "complete.obs"))

ggplot(correlation.long, aes(Value, countInspect)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor,
  aes(label = paste("r =", round(correlation, 2))),
  x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Failed inspection count as a function of risk factors") +
plotTheme()

Join in areal data

We use spatial joins to attach centroids of fishnets to polygon for neighborhoods The map below shows the fishnet as such:

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

# for live demo
mapview::mapview(final_net, zcol = "name")

Plotting local Moran’s I results

The grid map below shows the following four elements:

  1. Count of failed inspections in Chicago
  2. Local Morans I’s: dispersal or clustering of failed inspections relative to adjacent neighborhoods
  3. P-value: statistical significance
  4. Significant Hotspots: major clusters of failed inspections in Chicago at 99% significance level
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList2 <- list()

for(i in vars){
  varList2[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList2, ncol = 4, top = "Local Morans I statistics, Inspect"))

Modeling with Cross Validation

As briefly outlined earlier, k-fold validation not only provides a measure of the model’s performance, but also provides the actual predictions that will be used to forecast failed inspections. For a given cell, k-fold validation uses the rest of the cells as a training set to produce an individualized model that predicts only for that single cell. This process is carried out for every cell, the result is a predicted value for every cell.

This was not the only kind of validation considered as a part of this project, however. ‘Leave-one-group-out’ cross-validation (LOGO-CV) was also considered. LOGO CV considers groups of data (in this case, Chicago’s neighborhoods), and trains the whole model on every possible combination of groups such that one group is always missing from the analysis. This may help when one group has a unique relationship to data analyzed.

Also, two versions of each cross-validation method were undertaken. “Just Risk Factors” includes only the food inspection failures, food inspection locations, and the five factors from Chicago’s open data website, whereas “Spatial Process” includes all factors introduced to the model.

In total four variations of the model are generated: risk factors with and without spatial features, each with k-fold and spatial cross-validation versions.

# necessary workaround for hardcoded "countBurglaries" 
# redefine function
remove(crossValidate)
crossValidate <- function(dataset, id, dependentVariable, indVariables) {

 allPredictions <- data.frame()
 cvID_list <- unique(dataset[[id]])

 for (i in cvID_list) {
 
  thisFold <- i
  #cat("This hold out fold is", thisFold, "\n")
 
  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
   dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
   dplyr::select(id, geometry, indVariables, dependentVariable)
 
  regression <-
    glm(paste0(dependentVariable,"~."), family = "poisson",
     data = fold.train %>%
      dplyr::select(-geometry, -id))
  thisPrediction <-
   mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
 
  allPredictions <-
   rbind(allPredictions, thisPrediction)
 
 }
 return(st_sf(allPredictions))
}

# View(crossValidate)

## define independent variables
reg.ss.vars <- c("rodent_Bait.nn", "sanitation_311.nn","ordinance_Violation.nn", "graffiti.nn", "liquor_Retail.nn", "food_Inspect_Fail.nn", "Food_Inspect_Locations", "Inspect.isSig", "Inspect.isSig.dist", "loopDistance")

## define independent variables with spatial features
reg.vars <-
c("rodent_Bait.nn", "sanitation_311.nn","ordinance_Violation.nn", "graffiti.nn", "liquor_Retail.nn", "food_Inspect_Fail.nn", "Food_Inspect_Locations")

## RUN REGRESSIONS
reg.ss.spatialCV <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "name",
  dependentVariable = "countFailures",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countFailures, Prediction, geometry)


reg.cv <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "cvID",
  dependentVariable = "countFailures",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countFailures, Prediction, geometry)


reg.ss.cv <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "cvID",
  dependentVariable = "countFailures",
  indVariables = reg.ss.vars) %>%
  dplyr::select(cvID = cvID, countFailures, Prediction, geometry)


reg.spatialCV <- crossValidate(
  dataset = dplyr::rename(final_net, countFailures = countInspect),
  id = "name",
  dependentVariable = "countFailures",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = name, countFailures, Prediction, geometry)
reg.summary <-rbind(
  mutate(reg.cv,
  Error = Prediction - countFailures,
  Regression = "Random k-fold CV: Just Risk Factors"),

  mutate(reg.ss.cv,
  Error = Prediction - countFailures,
  Regression = "Random k-fold CV: Spatial Process"),

  mutate(reg.spatialCV,
  Error = Prediction - countFailures,
  Regression = "Spatial LOGO-CV: Just Risk Factors"),

  mutate(reg.ss.spatialCV,
  Error = Prediction - countFailures,
  Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()


# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <-
  reg.summary %>%
  group_by(Regression, cvID) %>%
    summarize(Mean_Error = mean(Prediction - countFailures, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  arrange(desc(MAE))
## Simple feature collection with 396 features and 5 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 396 x 6
##    Regression                         cvID  Mean_Error   MAE SD_MAE                  geometry
##    <chr>                              <chr>      <dbl> <dbl>  <dbl>             <POLYGON [m]>
##  1 Spatial LOGO-CV: Spatial Process   Rush~       7.75  7.75   7.75 ((358164.6 581374.5, 358~
##  2 Spatial LOGO-CV: Just Risk Factors Rush~       6.89  6.89   6.89 ((358164.6 581374.5, 358~
##  3 Spatial LOGO-CV: Just Risk Factors Rive~       4.96  4.96   4.96 ((356164.6 581374.5, 356~
##  4 Spatial LOGO-CV: Just Risk Factors Ande~      -4.90  4.90   4.90 ((354664.6 590374.5, 355~
##  5 Spatial LOGO-CV: Spatial Process   Ande~      -4.59  4.59   4.59 ((354664.6 590374.5, 355~
##  6 Spatial LOGO-CV: Just Risk Factors Wrig~      -3.55  3.55   3.55 ((355664.6 586874.5, 355~
##  7 Spatial LOGO-CV: Spatial Process   Rive~       3.52  3.52   3.52 ((356164.6 581374.5, 356~
##  8 Spatial LOGO-CV: Just Risk Factors Mill~      -3.12  3.12   3.12 ((358664.6 579374.5, 359~
##  9 Spatial LOGO-CV: Spatial Process   Mill~      -3.03  3.03   3.03 ((358664.6 579374.5, 359~
## 10 Spatial LOGO-CV: Just Risk Factors Gree~       2.89  2.89   2.89 ((356664.6 578874.5, 356~
## # ... with 386 more rows
# error_by_reg_and_fold %>%
#   arrange(MAE)

## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) +
  geom_histogram(bins = 30, colour="black", fill="#FDE725FF") +
  facet_wrap(~Regression) +
  geom_vline(xintercept = 0) + scale_x_continuous(
  breaks = seq(0, 8, by = 1)) +
  labs(title="Distribution of MAE",
  subtitle = "k-fold cross-validation vs. LOGO-CV",
  x="Mean Absolute Error", y="Count") +
plotTheme()

## Errors by Model
st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>%
  summarize(Mean_MAE = round(mean(MAE), 2),
  SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(2, color = "black", background = "#FDE725FF") %>%
  row_spec(4, color = "black", background = "#FDE725FF")
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.31 0.32
Random k-fold CV: Spatial Process 0.30 0.28
Spatial LOGO-CV: Just Risk Factors 0.59 1.12
Spatial LOGO-CV: Spatial Process 0.52 1.04
## neighborhood weights

neighborhood.weights <-
  filter(error_by_reg_and_fold,
  Regression == "Spatial LOGO-CV: Spatial Process") %>%
  group_by(cvID) %>%
  poly2nb(as_Spatial(.), queen=TRUE) %>%
  nb2listw(., style="W", zero.policy=TRUE)
    filter(error_by_reg_and_fold, str_detect(Regression, "LOGO")) %>%
  st_drop_geometry() %>%
  group_by(Regression) %>%
  summarize(Morans_I =
  moran.mc(abs(Mean_Error), neighborhood.weights,
    nsim = 999, zero.policy = TRUE,
      na.action=na.omit)[[1]],
    p_value =
      moran.mc(abs(Mean_Error), neighborhood.weights,
    nsim = 999, zero.policy = TRUE,
    na.action=na.omit)[[3]])
## # A tibble: 2 x 3
##   Regression                         Morans_I p_value
##   <chr>                                 <dbl>   <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors    0.141   0.023
## 2 Spatial LOGO-CV: Spatial Process      0.121   0.03